DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
作者信息
Xin Jin
摘要:
DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.
总结概括
PD分离
Motivation
- Prefill和Decode的干扰
chunked prefill会带来更多的内存访问
Prefill和Decode的计算特性不一样,耦合在一起可能会导致服务器的过度配置
Batch:
Prefill更容易到达Compute Bound,找到这样一个阈值是有意义的
Decode不容易到达Compute Bound,Batch size越大越好,一个decoding instance支持多个prefill instance
Prefill并行策略
Inter-Op可以理解为PP,Intra-Op可以理解为TP。这里有利用到排队理论来证明,在Rate较小的时候,TP执行时间更快;在Rate较大的时候,反而是PP因为排队时间较短,而更加快。
Decode并行策略
Design
PD分离会引进KV Cache和Prefill节点和Decode节点之间的开销,DistServe分别为带有高速通信网络和没有高速通信的网络的cluster设立了对应的Placement Algorithm。
首先是Instance间有IB网络的Cluster,则不添加任何限制,DistServe访问每一种TP PP可能,找到在SLO限制下GPU利用率(Goodput)最高的配置,然后再根据流量复制这一并行方案。其中,这里的P和D的并行策略可以不一样。
而对于Instance之间没有IB网络的Cluster,DistServe添加了一个限制:同一层Layer的Prefill和Decode必须在一个节点,从而减少昂贵的跨界点传输。所以Prefill和Decode的PP策略必须一致;同时,为了方便,DistServe将其资源分配(TP策略)也设立为一致。
同时,为了减少流水线的Bubble,DistServe profile出了在Prefill阶段可以使得GPU计算饱和的token数目,然后将这个作为Prefill和Decode的单次batch的上限。
为了避免Decode节点的OOM问题,DistServe将KV Cache的传输开关给予Decode节点,Decode节点可以按需获取已Prefill完的Req的KV cache。
同时,DistServe也支持重新设置调度策略功能。
Evaluation
实验部分
输入:泊松分布 Poisson distribution with different request rates
三个任务
- ChatBot
- TPOT要求高,但TFT也有要求
- Code Completion
- TPOT和TFT相比ChatBot更高
- Summarization
- 更宽松的TPOT和TFT
E2E
从SLO角度进行比较,做了三个任务的实验
- 加大Req进入的限制,看SLO什么时候会违反
- 增加SLO的限制,观察SLO什么时候会违反
Latency Breakdown
观测各部分的实际用时,由于算法的选择,尽量把数据传输都放在NV-Link上,传输时间其实很短
Ablation Studies
消融实验,模拟器的准确率和SLO违反率。
Algorithm Running Time